This report will analyze NBA team performance data to determine which factors best predict regular season wins and playoff births.
This dataset has 90 rows and 18 variables. For this analysis, we will
ignore the Year variable as it is not relevant for the
analysis.
2023-24 NBA team STAT leaders ESPN. Available at: https://www.espn.com/nba/stats/team/_/season/2024/seasontype/2/table/offensive/sort/avgOffensiveRebounds/dir/desc
2022-23 NBA team STAT leaders ESPN. Available at: https://www.espn.com/nba/stats/team/_/season/2023/seasontype/2/table/offensive/sort/avgOffensiveRebounds/dir/desc
2021-22 NBA team STAT leaders ESPN. Available at: https://www.espn.com/nba/stats/team/_/season/2022/seasontype/2/table/offensive/sort/avgOffensiveRebounds/dir/desc
2023-24 NBA standings Basketball. Available at: https://www.basketball-reference.com/leagues/NBA_2024_standings.html
2023-24 NBA standings Basketball. Available at: https://www.basketball-reference.com/leagues/NBA_2023_standings.html
2023-24 NBA standings Basketball. Available at: https://www.basketball-reference.com/leagues/NBA_2022_standings.html
VARIABLES TO PREDICT WITH
VARIABLES WE WANT TO PREDICT
W PTS FGM FGA FGPct
Min. :14 Min. :103.7 Min. :37.70 Min. :83.80 Min. :43.00
1st Qu.:34 1st Qu.:110.4 1st Qu.:40.50 1st Qu.:86.47 1st Qu.:46.12
Median :44 Median :113.2 Median :41.75 Median :88.45 Median :47.00
Mean :41 Mean :113.2 Mean :41.59 Mean :88.44 Mean :47.03
3rd Qu.:49 3rd Qu.:115.8 3rd Qu.:42.88 3rd Qu.:90.05 3rd Qu.:48.08
Max. :64 Max. :123.3 Max. :47.00 Max. :94.40 Max. :50.70
3PM 3PA 3Pct FTM
Min. :10.40 Min. :28.80 Min. :32.30 Min. :14.50
1st Qu.:11.50 1st Qu.:32.25 1st Qu.:34.90 1st Qu.:16.32
Median :12.45 Median :34.20 Median :36.05 Median :17.50
Mean :12.54 Mean :34.83 Mean :35.98 Mean :17.45
3rd Qu.:13.28 3rd Qu.:36.98 3rd Qu.:36.98 3rd Qu.:18.50
Max. :16.60 Max. :43.20 Max. :38.90 Max. :21.00
FTA FTPct OR DR
Min. :18.40 Min. :71.30 Min. : 7.600 Min. :30.10
1st Qu.:21.20 1st Qu.:76.03 1st Qu.: 9.525 1st Qu.:32.40
Median :22.35 Median :78.15 Median :10.350 Median :33.30
Mean :22.37 Mean :78.03 Mean :10.438 Mean :33.37
3rd Qu.:23.60 3rd Qu.:79.58 3rd Qu.:11.200 3rd Qu.:34.17
Max. :26.60 Max. :83.50 Max. :14.100 Max. :37.50
REB AST STL BLK
Min. :38.80 Min. :21.90 Min. :6.100 Min. :3.000
1st Qu.:42.75 1st Qu.:24.00 1st Qu.:7.025 1st Qu.:4.425
Median :43.85 Median :25.35 Median :7.400 Median :4.700
Mean :43.81 Mean :25.55 Mean :7.467 Mean :4.842
3rd Qu.:45.08 3rd Qu.:27.00 3rd Qu.:7.800 3rd Qu.:5.200
Max. :49.20 Max. :30.80 Max. :9.800 Max. :6.600
TO PF
Min. :11.20 Min. :15.60
1st Qu.:12.40 1st Qu.:18.60
Median :13.05 Median :19.65
Mean :13.13 Mean :19.45
3rd Qu.:13.80 3rd Qu.:20.40
Max. :15.70 Max. :22.10
The dataset summarizes statistics for NBA teams. The number of wins ranges from 14 to 64, while points per game range from 103.7 to 123.3. Field goals made (FGM) and attempted (FGA) vary, with percentages (FGPct) from 43% to 50.7%. Three-pointers made (3PM) and attempted (3PA) also show significant variation, with corresponding percentages (3Pct) from 32.3% to 38.9%. Free throws made (FTM) and attempted (FTA) display consistent averages, with free throw percentages (FTPct) between 71.3% and 83.5%. Rebounding statistics (OR, DR, REB), assists (AST), steals (STL), blocks (BLK), turnovers (TO), and personal fouls (PF) indicate typical ranges for team performance in these areas.
Row ———————————————————————–
This visualization is a scatterplot matrix displaying the relationship between field goal percentage (FGPct) and the number of wins (W).
Overall, the data suggests that teams with higher field goal percentages tend to have more wins.
Row ———————————————————————–
This visualization is a bar graph that displays the Average Field Goal Percentage, Three Point Percentage, and Free Throw Percentage. With the 1 on the x axis representing the averages of the teams making the playoffs, it is clear that teams with higher Average Field Goal Percentage, Three Point Percentage, and Free Throw Percentage are more likely to make the playoffs.
Row ———————————————————————–
This visualization is a bar graph that displays average PPG (points per game) from every year in the dataset. When looking at the graph, it is clear that the average points per game being scored by teams has increased since 2021. As a result of this, it is clear that scoring is more important now then it was in 2021 which means that teams need to prioritize high scoring players in order to maximize their wins.
For this analysis we will use a Linear Regression Model to predict Wins based on the predictors listed below.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| TO | -4.341 | 0.643 | -6.747 | 0.000 |
| STL | 5.001 | 0.835 | 5.988 | 0.000 |
| OR | 17.574 | 11.303 | 1.555 | 0.124 |
| DR | 17.136 | 11.171 | 1.534 | 0.129 |
| PF | 0.709 | 0.476 | 1.490 | 0.141 |
| PTS | -10.959 | 7.795 | -1.406 | 0.164 |
| REB | -13.715 | 11.169 | -1.228 | 0.224 |
| FGM | 15.089 | 14.200 | 1.063 | 0.292 |
| FTM | 11.245 | 12.307 | 0.914 | 0.364 |
| FGPct | 10.047 | 11.429 | 0.879 | 0.382 |
| (Intercept) | -542.916 | 637.618 | -0.851 | 0.397 |
3PM |
8.435 | 11.210 | 0.752 | 0.454 |
3Pct |
2.153 | 3.364 | 0.640 | 0.524 |
3PA |
1.770 | 3.558 | 0.497 | 0.620 |
| FGA | 1.234 | 6.059 | 0.204 | 0.839 |
| AST | 0.078 | 0.516 | 0.150 | 0.881 |
| FTPct | 0.388 | 2.703 | 0.144 | 0.886 |
| BLK | -0.090 | 0.703 | -0.128 | 0.898 |
| FTA | -0.293 | 9.304 | -0.031 | 0.975 |
After examining this model, 85% of the variability can be explained by this model and the only predictors that are significant at an alpha of 0.05 are TO (Turnovers) & STL (Steals). The model is reasonably accurate at prediciting number of wins as the RMSE is 4.46 which means its on average only off by about 4 wins.
After running the forward stepwise regression model, the amount of variance explained by the model was 86.73% and the RMSE is about 4.46 which is over a scale of 82 NBA games which means the model does a reasonably good job of predicting a team’s wins.
The predictor variables that are significant at a 0.05 alpha are the following:
Of these variables, the variance that seems to have the greatest impact on winning based on a 1 unit change is defense rebounds followed closely by turnovers. This makes sense as both of these statistic measures indicate that there was a change in possession which creates an opportunity for a team to score.
The model shows a reasonable fit on the training data (Entropy RSquare of 0.3914, Generalized RSquare of 0.5558) but performs significantly worse on the validation data (Entropy RSquare of 0.1183, Generalized RSquare of 0.1994). This suggests potential overfitting. The higher misclassification rate on the validation set (0.3182) compared to the training set (0.1176) also indicates overfitting.
For K=6, the training misclassification rate is 0.32353 (22 misclassifications), while the validation misclassification rate is 0.22727 (5 misclassifications). This suggests that the model performs reasonably well on both the training and validation datasets.
The confusion matrices suggest that while the model performs well in distinguishing between some classes, there is still a significant number of misclassifications, particularly for class 1.
When looking at Model 1 & 2 which are predicting whether a team makes the playoffs:
When looking at Models 3, 4, & 5 which are predicting number of wins:
The best model is Model 3 (M3 LinReg)
After creating and running many models, it is clear that the best model for predicting number of wins is Model 3: Linear Regression (M3 LinReg). When it comes to predicting whether a team makes the playoffs, the best model comes down to preference. If you want the best model for predicting whether a team will make the playoffs then the best choice is Model 1 (Bootstrap Forest). If you want the best model for overall accuracy, then the best choice is Model 2 (K Nearest Neighbors).
The variables that are significant at a 95% confidence interval and should be used in further analysis in predicting wins are the following:
---
title: "Predicting NBA Team Success"
output:
flexdashboard::flex_dashboard:
vertical_layout: scroll
source_code: embed
---
```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```
```{r load_data}
df <- read_csv("INFO3200_ProjectDatabase.csv")
```
Introduction {data-orientation=rows}
=======================================================================
Row {data-height=650}
-----------------------------------------------------------------------
### The Problem & Data Collection
#### The Problem
This report will analyze NBA team performance data to determine which factors best predict regular season wins and playoff births.
#### The Data
This dataset has 90 rows and 18 variables. For this analysis, we will ignore the `Year` variable as it is not relevant for the analysis.
#### Data Sources
2023-24 NBA team STAT leaders ESPN. Available at: https://www.espn.com/nba/stats/team/_/season/2024/seasontype/2/table/offensive/sort/avgOffensiveRebounds/dir/desc
2022-23 NBA team STAT leaders ESPN. Available at:
https://www.espn.com/nba/stats/team/_/season/2023/seasontype/2/table/offensive/sort/avgOffensiveRebounds/dir/desc
2021-22 NBA team STAT leaders ESPN. Available at:
https://www.espn.com/nba/stats/team/_/season/2022/seasontype/2/table/offensive/sort/avgOffensiveRebounds/dir/desc
2023-24 NBA standings Basketball. Available at: https://www.basketball-reference.com/leagues/NBA_2024_standings.html
2023-24 NBA standings Basketball. Available at: https://www.basketball-reference.com/leagues/NBA_2023_standings.html
2023-24 NBA standings Basketball. Available at: https://www.basketball-reference.com/leagues/NBA_2022_standings.html
### The Data
VARIABLES TO PREDICT WITH
* *PTS*: Points Per Game
* *FGM*: Field Goals Made
* *FGA*: Field Goals Attempted
* *FGPct*: Field goal Percentage
* *3PM*: 3 Pointers Made
* *3PA*: 3 Pointers Attempted
* *3Pct*: 3 Point Percentage
* *FTM*: Free Throws Made
* *FTA*: Free Throws Attempted
* *FTPct*: Free Throw Percentage
* *OR*: Offensive Rebounds
* *DR*: Defensive Rebounds
* *REB*: Total Rebounds
* *AST*: Assists
* *STL*: Steals
* *BLK*: Blocks
* *TO*: Turnovers
* *PF*: Personal Fouls
VARIABLES WE WANT TO PREDICT
* *W*: Wins
* *Playoffs*: Whether the team made the playoffs (1 = Yes, 0 = No)
Data
=======================================================================
Column {data-width=400}
-----------------------------------------------------------------------
### Summary Statistics
```{r, cache=TRUE}
#the cache=TRUE can be removed. This will allow you to rerun your code without it having to run EVERYTHING from scratch every time. If the output seems to not reflect new updates, you can choose Knit, Clear Knitr cache to fix.
#Clean data by replacing spaces with decimals
#colnames(df) <- make.names(colnames(df))
#View data
#remove RAD due to it being an index so not a real continuous number
df <- select(df, -Year, -Playoffs)
summary(df)
```
The dataset summarizes statistics for NBA teams. The number of wins ranges from 14 to 64, while points per game range from 103.7 to 123.3. Field goals made (FGM) and attempted (FGA) vary, with percentages (FGPct) from 43% to 50.7%. Three-pointers made (3PM) and attempted (3PA) also show significant variation, with corresponding percentages (3Pct) from 32.3% to 38.9%. Free throws made (FTM) and attempted (FTA) display consistent averages, with free throw percentages (FTPct) between 71.3% and 83.5%. Rebounding statistics (OR, DR, REB), assists (AST), steals (STL), blocks (BLK), turnovers (TO), and personal fouls (PF) indicate typical ranges for team performance in these areas.
Data Viz #1
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Scatterplot & Correlation Between FGPct and Wins
```{r, cache=TRUE}
# Create scatterplot matrix
ggpairs(select(df, FGPct, W))
```
Row
-----------------------------------------------------------------------
### Visualization Summary
This visualization is a scatterplot matrix displaying the relationship between field goal percentage (FGPct) and the number of wins (W).
* Correlation: The correlation coefficient between FGPct and W is 0.584, indicating a moderate positive relationship.
* Scatterplot: The bottom-left scatterplot shows individual data points of FGPct versus Wins, supporting the correlation by displaying a general upward trend.
Overall, the data suggests that teams with higher field goal percentages tend to have more wins.
Data Viz #2
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Field Goal, 3 Point, and Free Throw % Between Playoff and NonPlayoff Teams

Row
-----------------------------------------------------------------------
### Visualization Summary
This visualization is a bar graph that displays the Average Field Goal Percentage, Three Point Percentage, and Free Throw Percentage. With the 1 on the x axis representing the averages of the teams making the playoffs, it is clear that teams with higher Average Field Goal Percentage, Three Point Percentage, and Free Throw Percentage are more likely to make the playoffs.
Data Viz #3
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Average Points Per Season From 2021-2023

Row
-----------------------------------------------------------------------
### Visualization Summary
This visualization is a bar graph that displays average PPG (points per game) from every year in the dataset. When looking at the graph, it is clear that the average points per game being scored by teams has increased since 2021. As a result of this, it is clear that scoring is more important now then it was in 2021 which means that teams need to prioritize high scoring players in order to maximize their wins.
Linear Regression Model {data-orientation=rows}
=======================================================================
Row
-----------------------------------------------------------------------
### Predict Wins
For this analysis we will use a Linear Regression Model to predict Wins based on the predictors listed below.
```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
W_lm <- lm(W ~ . ,data = df)
summary(W_lm)
```
### Adjusted R-Squared
```{r, cache=TRUE}
ARSq<-round(summary(W_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE
```{r, cache=TRUE}
Sig<-round(summary(W_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------
### Regression Output
```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(W_lm))[,4])
out <- coef(summary(W_lm))[idx,]
knitr::kable(out, digits = 3) #pretty table output
```
Row
-----------------------------------------------------------------------
### Analysis Summary
After examining this model, 85% of the variability can be explained by this model and the only predictors that are significant at an alpha of 0.05 are TO (Turnovers) & STL (Steals). The model is reasonably accurate at prediciting number of wins as the RMSE is 4.46 which means its on average only off by about 4 wins.
Stepwise Regression Model
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Stepwise Regression Model predicting number of wins

### Analysis Summary
After running the forward stepwise regression model, the amount of variance explained by the model was 86.73% and the RMSE is about 4.46 which is over a scale of 82 NBA games which means the model does a reasonably good job of predicting a team’s wins.
The predictor variables that are significant at a 0.05 alpha are the following:
* DR (Defensive Rebounds)
* TO (Turnovers)
* FG% (Field Goal Percentage)
* STL (Steals)
* FGA (Field Goals Attempted)
* OR (Offensive Rebounds)
* 3PA (3 Pointers Attempted)
* 3P% (3 Point Percentage)
Of these variables, the variance that seems to have the greatest impact on winning based on a 1 unit change is defense rebounds followed closely by turnovers. This makes sense as both of these statistic measures indicate that there was a change in possession which creates an opportunity for a team to score.
Bootstrap Forest Model
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### Bootstrap Forest Model predicting whether a team will/will not make the playoffs.

### Analysis Summary
The model shows a reasonable fit on the training data (Entropy RSquare of 0.3914, Generalized RSquare of 0.5558) but performs significantly worse on the validation data (Entropy RSquare of 0.1183, Generalized RSquare of 0.1994). This suggests potential overfitting.
The higher misclassification rate on the validation set (0.3182) compared to the training set (0.1176) also indicates overfitting.
* The model performs well on training data but not as well on validation data (indicating overfitting)
* The confusion matrices reveal that the model struggles to distinguish between the classes, particularly in predicting the 0 class.
K Nearest Neighbors Model
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------
### K Nearest Neighbors Model predicting whether a team will/will not make the playoffs.

### Analysis Summary
* For K=6, the training misclassification rate is 0.32353 (22 misclassifications), while the validation misclassification rate is 0.22727 (5 misclassifications). This suggests that the model performs reasonably well on both the training and validation datasets.
* The confusion matrices suggest that while the model performs well in distinguishing between some classes, there is still a significant number of misclassifications, particularly for class 1.
Conclusion
=======================================================================
Column {data-width=500}
-----------------------------------------------------------------------

### Playoff Prediction Accuracy
When looking at Model 1 & 2 which are predicting whether a team makes the playoffs:
* M1 BootF has a higher sensitivity (77.78%), indicating it is better at correctly identifying positive instances (playoffs = 1). However, it has a higher error rate (31.82%) compared to M2 KNN.
* M2 KNN shows a lower error rate (22.73%), suggesting it is more accurate overall, but it has a lower sensitivity (66.67%).
### Wins Prediction Accuracy
When looking at Models 3, 4, & 5 which are predicting number of wins:
The best model is Model 3 (M3 LinReg)
* It has the highest RSquare
* It has the lowest RASE
* It has the lowest AAE
### Analysis Wrap Up
After creating and running many models, it is clear that the best model for predicting number of wins is Model 3: Linear Regression (M3 LinReg). When it comes to predicting whether a team makes the playoffs, the best model comes down to preference. If you want the best model for predicting whether a team will make the playoffs then the best choice is Model 1 (Bootstrap Forest). If you want the best model for overall accuracy, then the best choice is Model 2 (K Nearest Neighbors).
### Significant Variables
The variables that are significant at a 95% confidence interval and should be used in further analysis in predicting wins are the following:
* DR (Defensive Rebounds)
* TO (Turnovers)
* FG% (Field Goal Percentage)
* STL (Steals)
* FGA (Field Goals Attempted)
* OR (Offensive Rebounds)
* 3PA (3 Pointers Attempted)
* 3P% (3 Point Percentage)